An IR Approach for Translating New Words from Nonparallel, Comparable Texts
نویسندگان
چکیده
In recent years, there is a phenomenal growth in the amount of online text material available from the greatest information repository known as the World Wide Web. Various traditional information retrieval(IR) techniques combined with natural language processing(NLP) techniques have been re-targeted to enable efficient access of the WWW--search engines, indexing, relevance feedback, query term and keyword weighting, document analysis, document classification, etc. Most of these techniques aim at efficient online search for information already on the Web. Meanwhile, the corpus linguistic community regards the W W W as a vast potential of corpus resources. It is now possible to download a large amount of texts with automatic tools when one needs to compute, for example, a list of synonyms; or download domain-specific monolingual texts by specifying a keyword to the search engine, and then use this text to extract domain-specific terms. It remains to be seen how we can also make use of the multilingual texts as NLP resources. In the years since the appearance of the first papers on using statistical models for bilingual lexicon compilation and machine translation(Brown et al., 1993; Brown et al., 1991; Gale and Church, 1993; Church, 1993; Simard et al., 1992), large amount of human effort and time has been invested in collecting parallel corpora of translated texts. Our goal is to alleviate this effort and enlarge the scope of corpus resources by looking into monolingual, comparable texts. This type of texts are known as nonparallel corpora. Such nonparallel, monolingual texts should be much more prevalent than parallel texts. However, previous a t tempts at using nonparallel corpora for terminology translation were constrained by the inadequate availability of same-domain, comparable texts in electronic form. The type of nonparallel texts obtained from the LDC or university libraries were often restricted, and were usually out-of-date as soon as they became available. For new word translation, the timeliness of corpus resources is a prerequisite, so is the continuous and automatic availability of nonparallel, comparable texts in electronic form. Data collection effort should not inhibit the actual translation effort. Fortunately, nowadays the World Wide Web provides us with a daily increase of fresh, up-to-date multilingual material, together with the archived versions, all easily downloadable by software tools running in the background. It is possible to specify the URL of the online site of a newspaper, and the start and end dates, and automatically download all the daily newspaper materials between those dates. In this paper, we describe a new method which combines IR and NLP techniques to extract new word translation from automatically downloaded English-Chinese nonparallel newspaper texts.
منابع مشابه
Exploiting Comparable Corpora with TER and TERp
In this paper we present an extension of a successful simple and effective method for extracting parallel sentences from comparable corpora and we apply it to an Arabic/English NIST system. We experiment with a new TERp filter, along with WER and TER filters. We also report a comparison of our approach with that of (Munteanu and Marcu, 2005) using exactly the same corpora and show performance g...
متن کاملFinding the Better Indexing units for Chinese Information Retrieval
In the processing of Chinese documents and queries in information retrieval (IR), one has to identify the units that are used as indexes. Words and n-grams had been used as indexes in several previous studies, which showed that both kinds of indexes lead to comparable IR performances. In this study, we carried out more experiments to find the better way to index Chinese texts. First, we investi...
متن کاملThe Effect of Lexicon-based Debates on the Felicity of Lexical Equivalents in Translating Literary Texts by Iranian EFL Learners
This study was an attempt to investigate the effect of lexicon-based debates on the felicity of lexical equivalents in translating literary texts by Iranian EFL learners. To fulfill the purpose of this study, 59 university students, majoring in English Translation, were randomly assigned to the experimental and control groups from a total of 73 students based on their performance on a mock TOE...
متن کاملMaterial Development and English for Academic Purposes Word Lists; a Reductionist Approach
Nagy (1988) states that vocabulary is a prerequisite factor in comprehension. Drawing upon a reductionist approach and having in mind the prospects for material development, this study aimed at creating an English for Academic Purposes Word List (EAPWL). The corpus of this study was compiled from a corpus containing 6479 pages of texts, 2,081,678 million tokens (running words) and 63825 types (...
متن کاملOn the Use of Comparable Corpora to Improve SMT performance
We present a simple and effective method for extracting parallel sentences from comparable corpora. We employ a statistical machine translation (SMT) system built from small amounts of parallel texts to translate the source side of the nonparallel corpus. The target side texts are used, along with other corpora, in the language model of this SMT system. We then use information retrieval techniq...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998